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Abstract 

We introduce a model-free algorithm for learning in Markov 
decision processes with parameterized actions—discrete ac¬ 
tions with continuous parameters. At each step the agent must 
select both which action to use and which parameters to use 
with that action. We introduce the Q-PAMDP algorithm for 
learning in these domains, show that it converges to a local 
optimum, and compare it to direct policy search in the goal¬ 
scoring and Platform domains. 


1 Introduction 

Reinforcement learning agents typically have either a dis¬ 
crete or a continuous action space ( [Sutton and Barto 1998 ). 
With a discrete action space, the agent decides which distinct 
action to perform from a finite action set. With a continuous 
action space, actions are expressed as a single real-valued 
vector. If we use a continuous action space, we lose the abil¬ 
ity to consider differences in kind: all actions must be ex¬ 
pressible as a single vector. If we use only discrete actions, 
we lose the ability to finely tune action selection based on 
the current state. 

A parameterized action is a discrete action parameterized 
by a real-valued vector. Modeling actions this way intro¬ 
duces structure into the action space by treating different 
kinds of continuous actions as distinct. At each step an agent 
must choose both which action to use and what parameters 
to execute it with. For example, consider a soccer playing 
robot which can kick, pass, or run. We can associate a con¬ 
tinuous parameter vector to each of these actions: we can 
kick the ball to a given target position with a given force, 
pass to a specific position, and run with a given velocity. 
Each of these actions is parameterized in its own way. Pa¬ 
rameterized action Markov decision processes (PAMDPs) 
model situations where we have distinct actions that require 
parameters to adjust the action to different situations, or 
where there are multiple mutually incompatible continuous 
actions. 

We focus on how to learn an action-selection policy 
given pre-defined parameterized actions. We introduce the 
Q-PAMDP algorithm, which alternates learning action- 
selection and parameter-selection policies and compare it to 
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a direct policy search method. We show that with appropri¬ 
ate update rules Q-PAMDP converges to a local optimum. 
These methods are compared empirically in the goal and 
Platform domains. We found that Q-PAMDP out-performed 
direct policy search and fixed parameter SARSA. 


2 Background 

A Markov decision process (MDP) is a tuple {S, A, P, P, 7 ), 
where S' is a set of states, A is a set of actions, P(s, a, s') is 
the probability of transitioning to state s' from state s after 
taking action a, P(s, a, r) is the probability of receiving re- 
ward r for taking action a in state s, and 7 is a discount factor 
( Sutton and Barto 1998| . We wish to find a policy, 7 r(a|s), 
which selects an action for each state so as to maximize the 
expected sum of discounted rewards (the return). 

The value function U^(s) is defined as the expected dis¬ 
counted return achieved by policy tt starting at state s 


V^s)=E^ 



Similarly, the action-value function is given by 


Q^s,a)=E^ [ro^-fV^s')], 


as the expected return obtained by taking action a in state 
s, and then following policy tt thereafter. While using the 
value function in control requires a model, we would prefer 
to do so without needing such a model. We can approach 
this problem by learning Q, which allows us to directly se¬ 
lect the action which maximizes (3^(s, a). We can learn Q 
for an optimal policy usin g a method such as Q-learning 
( | Watkins and Dayan 1992] ). In domains with a continuous 
state space, we can represent Q{s, a) using parametric func¬ 
tion approximation with a set of parameters uj and learn this 


with algorithms such as gradient descent SARSA(A) ( [Sutton 
[and Barto 1998[ ). 


For problems with a continuous action space (A C 
selecting the optimal action with respect to Q{s, a) is non¬ 
trivial, as it requires finding a global maximum for a func¬ 
tion in a continuous space. We can avoid this problem using 
a policy search algorithm, where a class of policies param¬ 
eterized by a set of parameters 0 is given, which transforms 
the problem into one of direct optimization over 0 for an 
objective function J{0). Several policy search approaches 

















(a) The discrete action space 
consists of a finite set of distinct 
actions. 
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(b) The continu¬ 
ous action space 
is a single contin¬ 
uous real-valued 
space. 
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(c) The parameterized action space has multiple discrete ac¬ 
tions, each of which has a continuous parameter space. 


Figure 1: Three types of action spaces: discrete, continuous, 
and parameterized. 


exist, including policy gradient methods, entropy-based ap¬ 
proaches, path_ 2 ntegral_approaches^_andsamp^based ap¬ 
proaches ( [Deisenroth, Neumann^and Peters 2013 ). 

Parameterized Tasks 

A parameterized task is a problem defined by a task parame¬ 
ter vector r given at the beginning of each episode. These pa¬ 
rameters are fixed throughout an episode, and th e goal is to 
learn a task dependent policy. Kober et al. ( |2012| ) developed 
algorithms to adjust motor primitives to different task pa¬ 
rameters. They apply this to learn table-tennis and dar ts with 
different starting positions and targets. Da Silva et al. ( |2012| ) 
introduced the idea of a parameterized skill as a task depen¬ 
dent parameterized policy. They sample a set of tasks, learn 
their associated parameters, and determine a mapping from 
task to policy parameters. Deisenroth et al ( |2014| ) applied 
a model-based method to learn a task dependent parame¬ 
terized policy. This is used to learn task dependent policies 
for ball-hitting task, and for solving a block manipulation 
problem. Parameterized tasks can be used as parameterized 
actions. For example, if we learn a parameterized task for 
kicking a ball to position r, this could be used as a parame¬ 
terized action kick-to(r). 

3 Parameterized Action MDPs 

We consider MDPs where the state space is continuous {S C 
W^) and the actions are parameterized: there is a finite set of 
discrete actions = {ai, a 2 ,..., a/^}, and each a e 
has a set of continuous parameters Xa ^ . An action 


is a tuple (a, x) where a is a discrete action and x are the 
parameters for that action. The action space is then given by 

= U I ^ ^ 

which is the union of each discrete action with all possible 
parameters for that action. We refer to such MDPs as pa¬ 
rameterized action MDPs (PAMDPs). Figure depicts the 
different action spaces. 

We apply a two-tiered approach for action selection: first 
selecting the parameterized action, then selecting the param¬ 
eters for that action. The discrete-action policy is denoted 
7r^{a\s). To select the parameters for the action, we define 
the action-parameter policy for each action a as 7r^{x\s). 
The policy is then given by 

7r(a,x|s) = 7r^(a|s)7r"(x|5). 

In other words, to select a complete action (a, x), we sample 
a discrete action a from 7 r^(a|s) and then sample a parame¬ 
ter X from 7 r^{x\s). The action policy is defined by the pa¬ 
rameters uo and is denoted by 7 r^{a\s). The action-parameter 
policy for action a is determined by a set of parameters Oa, 
and is denoted 7r^(x|s). The set of these parameters is given 
hy0=[d,„...,d,,]. 

The first approach we consider is direct policy search. We 
use a direct policy search method to optimize the objective 
function. 

with respect to ( 6 >, cc), where sq is a state sampled according 
to the state distribution D. J is the expected return for a 
given policy starting at an initial state. 

Our second approach is to alternate updating the 
parameter-policy and learning an action-value function for 
the discrete actions. For any PAMDP M = {S, A, P, P, 7 ) 
with a fixed parameter-policy tt^, there exists a correspond¬ 
ing discrete action MDP, Mq = (P, Ad, Pe, Re,"f), where 

Po{s'\s,a) = J 7 rQ{x\s)P{s'\s,a,x)dx, 

xeXa 

R 0 {r\s,a) = J 7 rQ{x\s)R{r\s,a,x)dx. 

xeXa 

We represent the action-value function for Mq using func¬ 
tion approximation with parameters 00. For Mq, there exists 
an optimal set of representation weights uJq which maxi¬ 
mizes with respect to uj. Let 

W{0) = argmax J(6>,cc;) = ujq. 

UJ 

We can learn W{0) for a fixed 0 using a Q-learning algo¬ 
rithm. Finally, we define for fixed uj, 

JU0) = J{0,uj), 

H{0) = J{0,W{0)). 

H{0) is the performance of the best discrete policy for a 
fixed 6. 













Algorithm 1 Q-PAMDP(/c) 

Input: 

Initial parameters Oq , ujq 
Parameter update method P-UPDATE 
Q-learning algorithm Q-LEARN 

Algorithm: 

UJ ^ Q-LEARN(~)(M(,,Wo) 

repeat 

6» ^P-UPDATE('=)(J<^,6>) 
w ^ Q-LEARN(~)(M0 ,w) 
until 0 converges 


Algorithm describes a method for alternating updating 
0 and UJ. The algorithm uses two input methods: P-UPDATE 
and Q-LEARN and a positive integer parameter k, which 
determines the number of updates to 0 for each iteration. 
P-UPDATE(/, 0) should be a policy search method that op¬ 
timizes 0 with respect to objective function /. Q-LEARN 
can be any algorithm for Q-leaming with function approxi¬ 
mation. We consider two main cases of the Q-PAMDP algo¬ 
rithm: Q-PAMDP(l) and Q-PAMDP(oo). 

Q-PAMDP(l) performs a single update of 0 and then re¬ 
learns UJ to convergence. If at each step we only update 0 
once, and then update uj until convergence, we can optimize 
0 with respect to H. In the next section we show that if we 
can find a local optimum 0 with respect to H, then we have 
found a local optimum with respect to J. 

4 Theoretical Results 

We now show that Q-PAMDP(l) converges to a local or 
global optimum with mild assumptions. We assume that it¬ 
erating P-UPDATE converges to some 6>* with respect to a 
given objective function /. As the P-UPDATE step is a de¬ 
sign choice, it can be selected with the appropriate conver¬ 
gence property. Q-PAMDP(l) is equivalent to the sequence 

= W{et) 

^t+i = P-UPDATE(J^,^,,^,), 


if Q-LEARN converges to IU(6>) for each given 0. 

Theorem 4.1 (Convergence to a Local Optimum). For any 
Oq, if the sequence 

0t+i=P-UPDATE{H,0t), (1) 

converges to a local optimum with respect to H, then Q- 
PAMDP( 1) converges to a local optimum with respect to J. 

Proof By definition of the sequence above ujt = W (Of), so 
it follows that 


In other words, the objective function J equals H if uj = 
W{0). Therefore, we can replace J with H in our update 
for 0, to obtain the update rule 

^,+1 =P-UPDATE(i7,^,)- 


Therefore by equation[2the sequence Of converges to a local 
optimum 6>* with respect to H. Let cc* = IU(6>*). As 6>* is a 
local optimum with respect to i7, by definition there exists 
e > 0, s.t. 


\\e*-e\\,<e ^ H{e)<H{e*). 


Therefore for any cc. 




<6 ^ ||r-^||2<6 

^ H{e) < H{0^) 

J{0,uj) < J{0\uj^). 


Therefore (6>*, cc*) is a local optimum with respect to J. □ 

In summary, if we can locally optimize 0, and uj = W{0) 
at each step, then we will find a local optimum for J(6>, uj). 
The conditions for the previous theorem can be met by as¬ 
suming that P-UPDATE is a local optimization method such 
as a gradient based policy search. A similar argument shows 
that if the sequence Ot converges to a global optimum with 
respect to H, then Q-PAMDP(l) converges to a global opti¬ 
mum {O'" .uj""). 

One problem is that at each step we must re-learn W{0) 
for the updated value of 0. We now show that if updates to 0 
are bounded and IU(6>) is a continuous function, then the re¬ 
quired updates to uj will also be bounded. Intuitively, we are 
supposing that a small update to 0 results in a small change 
in the weights specifying which discrete action to choose. 
The assumption that W(0) is continuous is strong, and may 
not be satisfied by all PAMDPs. It is not necessary for the 
operation of Q-PAMDP(l), but when it is satisfied we do 
not need to completely re-learn uj after each update to 0. We 
show that by selecting an appropriate a we can shrink the 
differences in uj as desired. 


Theorem 4.2 (Bounded Updates to uj). If W is continuous 
with respect to 0, and updates to 0 are of the form 

Ot+i =0tF atP-UPDATE{Ot,uJt), 
with the norm of each P-UPDATE bounded by 

0 < \ \P-UPDATE{et,uJt)\\2 < 

for some (5 > 0, then for any difference in uj e > 0, there is 
an initial update rate ao > 0 such that 

at<ao => ||wt+i - a;t ||2 < e. 

Proof Let e > 0 and 

5 

“ ||P-UPDATE(0t,wO||2’ 

As < <^o 5 it follows that 


5>at||P-UPDATE(0t,a;t)||2 
= ||atP-UPDATE(0t,a;t)||2 
= ll^t+l - 0t\\2 ■ 

So we have 

ll^t+l “ ^t|l2 < 

As W is continuous, this means that 


□ 










In other words, if our update to 0 is bounded and W is 
continuous, we can always adjust the learning rate a so that 
the difference between ujt and cc^+i is bounded. 

With Q-PAMDP(l) we want P-UPDATE to optimize 
H(0). One logical choice would be to use a gradient update. 
The next theorem shows that gradient of H is equal to the 
gradient of J if cc = W{0). This is useful as we can apply 
existing gradient-based policy search methods to compute 
the gradient of J with respect to 0. The proof follows from 
the fact that we are at a global optimum of J with respect to 
cc, and so the gradient V^jJ is zero. This theorem requires 
that W is differentiable (and therefore also continuous). 

Theorem 4.3 (Gradient of 11(0)). IfJ(0,uj) is differentiable 
with respect to 0 and uj and W(0) is differentiable with re¬ 
spect to 0, then the gradient of H is given by VeH(0) = 
J( 6 >, cc*), where cc* = IE (0). 

Proof If 0 e and uo G then we can compute the 
gradient of H by the chain rule: 

dH(0) _ dJ(0,W(0)) 
dOi dOi 

^ dJ{e,uj*)dej ^ dJ{e,L 0 *) dcol 

^ 39j dOi ^ dtol dOi 

_ dJ{9, UJ*) ^ dJ{9, UJ*) dujl 
dOi 

where cc* = IE (0). Note that as by definitions of IE, 
cc* = W(0) = arg max J( 6 >, cc), 

UJ 

we have that the gradient of J with respect to uj is zero 
dJ{0,uj'^)/duul = 0 , as cj is a global maximum with respect 
to J for fixed 0. Therefore, we have that 

VeH(e) = VeJ(e,uj^). 

□ 

To summarize, if W(0) is continuous and P-UPDATE 
converges to a global or local optimum, then Q-PAMDP(l) 
will converge to a global or local optimum, respectively, and 
the Q-LEARN step will be bounded if the update rate of 
the P-UPDATE step is bounded. As such, if P-UPDATE is a 
policy gradient update step then Q-PAMDP by Theorem 4.1 
will converge to a local optimum and by Theorem 4.4 the 
Q-LEARN step will require a fixed number of updates. This 
policy gradient step can use the gradient of J with respect to 
0. 

With Q-PAMDP(oo) each step performs a full optimiza¬ 
tion on 0 and then a full optimization of uj. The 0 step would 
optimize J(0,uj), not H(0), as we do update uj while we 
update 0. Q-PAMDP(oc) has the disadvantage of requiring 
global convergence properties for the P-UPDATE method. 

Theorem 4.4 (Local Convergence of Q-PAMDP(oo)). If at 
each step of Q-PAMDP(oo) for some bounded set 0.* 

= argmax J( 6 >,a;t), 

0G© 

= IE( 6 >t+i), 

then Q-PAMDP( oo ) converges to a local optimum. 


Proof. By definition of IE, cct+i = argmax^ J( 6 >t+i 5 ^)- 
Therefore this algorithm takes the form of direct alternat¬ 
ing optimization. As such, it converges to a local optimum 
( |Bezdek and Hathaway 2002] ). □ 


Q-PAMDP(oo) has weaker convergence properties than 
Q-PAMDP(l), as it requires a globally convergent P- 
UPDATE. However, it has the potential to bypass nearby 
local optima ( [Bezdek and Hathaway 2002| ). 

5 Experiments 


We first consider a simplified robot soccer problem (Kitano 
et al. 1997] ) where a single striker attempts to score a goal 
against a keeper. Each episode starts with the player at a 
random position along the bottom bound of the field. The 
player starts with the ball in possession, and the keeper is 
positioned between the ball and the goal. The game takes 
place in a 2D environment where the player and the keeper 
have a position, velocity and orientation and the ball has a 
position and velocity, resulting in 14 continuous state vari¬ 
ables. 

An episode ends when the keeper possesses the ball, the 
player scores a goal, or the ball leaves the field. The reward 
for an action is 0 for non-terminal state, 50 for a terminal 
goal state, and —d for a terminal non-goal state, where d 
is the distance of the ball to the goal. The player has two 
parameterized actions: kick-to(x, y), which kicks to ball to¬ 
wards position (x^y)\ and shoot-goal(/i), which shoots the 
ball towards a position h along the goal line. Noise is added 
to each action. If the player is not in possession of the ball, it 
moves towards it. The keeper has a fixed policy: it moves to¬ 
wards the ball, and if the player shoots at the goal, the keeper 
moves to intercept the ball. 

To score a goal, the player must shoot around the keeper. 
This means that at some positions it must shoot left past the 
keeper, and at others to the right past the keeper. However at 
no point should it shoot at the keeper, so an optimal policy 
is discontinuous. We split the action into two parameterized 
actions: shoot-goal-left, shoot-goal-right. This allows us to 
use a simple action selection policy instead of complex con¬ 
tinuous action policy. This policy would be difficulty to rep¬ 
resent in a purely continuous action space, but is simpler in 
a parameterized action setting. 

We represent the action-value function for the discrete ac¬ 
tion a using linear function approximation with Eourie r ba¬ 
sis features ( jKonidaris, Osentoski, and Thomas 201 1| ). As 
we have 14 state variables, we must be selective in which 
basis functions to use. We only use basis functions with 
two non-zero elements and exclude all velocity sta te vari- 
ables. We use the soft-max discrete action policy ( |Sutton| 
and Barto 1998| ). We represent the action-parameter pol¬ 
icy ttq as a normal distribution around a weighted sum of 
features 7 r^(x|s) = M(0^f>a(s)^Tf), where Oa is a ma¬ 
trix of weights, and V^a(<s) gives the features for state s, 
and E is a fixed covariance matrix. We use specialized fea¬ 
tures for each action. Eor the shoot-goal actions we are us¬ 
ing a simple linear basis ( 1 ,^), where g is the projection 
of the keeper onto the goal line. Eor kick-to we use lin¬ 
ear features ( 1 , hx^ by^ hy‘^^ (hx — kx)/ \ \b — k \\2 , (by — 

























ky)l \ \b — k\\ 2 ), where (bx^by) is the position of the ball 
and {kx, ky) is the position of the keeper. 

For the direct policy search approach, we use the episodic 
natura l actor critic (eNAC) algorithm ( [Peters and Schaal] 
2QQ8| ), computing the gradient of with respect to 

(cc, O). For the Q-PAMDP approach we use the gradient- 
descent Sarsa(A) algorithm for Q-learning, and the eNAC 
algorithm for policy search. At each step we perform one 
eNAC update based on 50 episodes and then refit using 
50 gradient descent Sarsa(A) episodes. 


Goal Scoring Probability 



Figure 2: Average goal scoring probability, averaged over 
20 runs for Q-PAMDP(l), Q-PAMDP(oc), fixed parameter 
Sarsa, and eNAC in the goal domain. Intervals show stan¬ 
dard error. 

Return is directly correlated with goal scoring probability, 
so their graphs are close to indentical. As it is easier to in¬ 
terpret, we plot goal scoring probability in figure We can 
see that direct eNAC is outperformed by Q-PAMDP(l) and 
Q-PAMDP(oc). This is likely due to the difficulty of opti¬ 
mizing the action selection parameters directly, rather than 
with Q-learning. 

For both methods, the goal probability is greatly in¬ 
creased: while the initial policy rarely scores a goal, both 
Q-PAMDP(l) and Q-PAMDP(oo) increase the probability 
of a goal to roughly 35%. Direct eNAC converged to a lo¬ 
cal maxima of 15%. Finally, we include the performance 
of SARSA(A) where the action parameters are fixed at 
the initial Oq. This achieves roughly 20% scoring proba¬ 
bility. Both Q-PAMDP(l) and Q-PAMDP(oc) strongly out¬ 
perform fixed parameter SARSA, but eNAC does not. Figure 
[^depicts a single episode using a converged Q-PAMDP(l) 
policy— the player draws the keeper out and strikes when 
the goal is open. 

Next we consider the Platform domain, where the agent 
starts on a platform and must reach a goal while avoiding 
enemies. If the agent reaches the goal platform, touches an 
enemy, or falls into a gap between platforms, the episode 
ends. This domain is depicted in figure]^ The reward for a 
step is the change in x value for that step, divided by the to¬ 



Figure 3: A robot soccer goal episode using a converged Q- 
PAMDP(l) policy. The player runs to one side, then shoots 
immediately upon overtaking the keeper. 



Figure 4: A screenshot from the Platform domain. The 
player hops over an enemy, and then leaps over a gap. 


tal length of all the platforms and gaps. The agent has two 
primitive actions: run or jump, which continue for a fixed pe¬ 
riod or until the agent lands again respectively. There are two 
different kinds of jumps: a high jump to get over enemies, 
and a long jump to get over gaps between platforms. The 
domain therefore has three parameterized actions: mn(dx), 
hop(dx), and leap(dx). The agent only takes actions while 
on the ground, and enemies only move when the agent is 
on their platform. The state space consists of four variables 
(x, X, ex, ex), representing the agent position, agent speed, 
enemy position, and enemy speed respectively. For learning 
Quj, as in the previous domain, we use linear function ap¬ 
proximation with the Fourier basis. We apply a softmax dis¬ 
crete action policy based on and a Gaussian parameter 
policy based on scaled parameter features 'ipa{s). 

Figureshows the performance of eNAC, Q-PAMDP(l), 
Q-PAMDP(oc), and SARSA with fixed parameters. Both 
Q-PAMDP(l) and Q-PAMDP(oc) outperformed the fixed 
parameter SARSA method, reaching on average 50% and 
65% of the total distance respectively. We suggest that Q- 
PAMDP(oo) outperforms Q-PAMDP(l) due to the nature 
of the Platform domain. Q-PAMDP(l) is best suited to do¬ 
mains with smooth changes in the action-value function with 
respect to changes in the parameter-policy. With the Plat¬ 
form domain, our initial policy is unable to make the first 
jump without modification. When the policy can reach the 
second platform, we need to drastically change the action- 
value function to account for this platform. Therefore, Q- 







































Figure 5: Average percentage distance covered, averaged 
over 20 runs for Q-PAMDP(l), Q-PAMDP(oo), and eNAC 
in the Platform domain. Intervals show standard error. 



Figure 6: A successful episode of the Platform domain. 
The agent hops over the enemies, leaps over the gaps, and 
reaches the last platform. 


PAMDP(l) may be poorly suited to this domain as the small 
change in parameters that occurs between failing to making 
the jump and actually making it results in a large change 
in the action-value function. This is better than the fixed 
SARSA baseline of 40%, and much better than direct op¬ 
timization using eNAC which reached 10%. Figure shows 
a successfully completed episode of the Platform domain. 


6 Related Work 


Hauskrecht et al. ( 2004| ) introduced an algorithm for solv¬ 
ing factored MDPs with a hybrid discrete-continuous action 
space. However, their formalism has an action space with a 
mixed set of discrete and continuous components, whereas 
our domain has distinct actions with a different number of 
continuous components for each action. Furthermore, they 
assume the domain has a compact factored representation, 
and only consider planning. 

Rachelson ( |2009| ) encountered parameterized actions in 
the form of an action to wait for a given period of time 
in his research on time dependent, continuous time MDPs 
(TMDPs). He developed XMDPs, which are TMDPs with 
a parameterized action space ( [Rachelson 2Q09| ). He devel¬ 
oped a Bellman operator for this domain, and in a later paper 
mentions that the TiMDFpoiy algorithm can work with pa¬ 
rameterized actions, although this specifically refers to the 


parameterized wait action ( [Rachelson, Fabiani, and Garcia 
2009| ). This research also takes a planning perspective, and 
only considers a time dependent domain. Additionally, the 
size of the parameter space for the parameterized actions is 
the same for all actions. 


Hoey et al. ( 2013| ) considered mixed discrete-continuous 
actions in their work on Bayesian affect control theory. To 
approach this problem they use a form of POMCP, a Monte 
Carlo sampling algorithm, using domain specific adjust- 
ments to compute the continuous action components ( [Silver 


and Veness 2010[ ). They note that the discrete and contin¬ 


uous components of the action space reflect different con¬ 
trol aspects: the discrete control provides the “what”, while 
the continuous control describes the “how” ([Hoey, Schroder, 
[and AlhothaliW^ . 


In their research on symbolic dynamic programming 
(SDP) algorithms, Zamani et al. ( [2Q12| ) considered domains 
with a set of discrete parameterized actions. Each of these 
actions has a different parameter space. Symbolic dynamic 
programming is a form of planning for relational or first- 
order MDPs, where the MDP has a set of logical relation¬ 
ships defining its dynamics and reward function. Their algo¬ 
rithms represent the value function as an extended algebraic 
decision diagram (XADD), and is limited to MDPs with pre¬ 
defined logical relations. 


A hierarchical MDP is an MDP where each action has 
subtasks. A subtask is itself an MDP with its own states 
and actions which may have their own subtasks. Hierarchi¬ 
cal MDPs are well-suited for representing parameterized ac¬ 
tions as we could consider selecting the parameters for a dis¬ 
crete action as a subtask. MAXQ is a meth od for value func - 
tion decomposition of hierarchical MDPs ( [Dietterich 20QQ[ ). 
One possiblity is to use MAXQ for learning the action- 
values in a parameterized action problem. 


7 Conclusion 

The PAMDP formalism models reinforcement learning do¬ 
mains with parameterized actions. Parameterized actions 
give us the adaptibility of continuous domains and to use 
distinct kinds of actions. They also allow for simple repre¬ 
sentation of discontinuous policies without complex param- 
eterizations. We have presented three approaches for model- 
free learning in PAMDPs: direct optimization and two vari¬ 
ants of the Q-PAMDP algorithm. We have shown that Q- 
PAMDP(l), with an appropriate P-UPDATE method, con¬ 
verges to a local or global optimum. Q-PAMDP(oo) with a 
global optimization step converges to a local optimum. 

We have examined the performance of these approaches 
in the goal scoring domain and the Platformer domain. 
The robot soccer goal domain models the situation where a 
striker must out-maneuver a keeper to score a goal. Of these, 
Q-PAMDP(l) and Q-PAMDP((X)) outperformed eNAC and 
fixed parameter SARSA. Q-PAMDP(l) and Q-PAMDP(oc) 
performed similarly well in terms of goal scoring, learning 
policies that score goals roughly 35% of the time. In the 
Platform domain we found that both Q-PAMDP(l) and Q- 
PAMDP(oo) outperformed eNAC and fixed SARSA. 
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